BT4211 Final Project: Credit Card Anaylsis (Group 10)¶

Done by:

  • Tham Shun Cong Aaron
  • Low Zi Yan Elicia
  • Nicolai Alexander Neo Rui Sheng
  • Cheang Jia Wei

Note: The cluster label might change whenever a new code instance runs but the characteristics will remain the same¶

  • CUSTID: Identification of Credit Cardholder (Categorical)

  • BALANCE (Available Credit): Difference between credit limit and purchases

After making a payment, you will have more available credit/balance to borrow again.

  • BALANCE_FREQUENCY: How frequently the Balance is updated.

How often is the Balance being increased or decreased. Ranges between 0 and 1 (1 = frequently updated, 0 = not frequently -updated)

  • PURCHASES: Cumulative amount of purchases made from the account (ONEOFF + INSTALLMENTS)

Total amount of money spent to purchase stuff with that Account.

  • ONEOFF_PURCHASES: Records the Maximum purchase amount did in one-go (one-time payment), doesnt cumulate!

When a customer makes a single purchase of a product/service via one-time payment

  • INSTALLMENTS_PURCHASES: Cumulative Amount of purchase done in installment

Total amount of money spent for purchases using Installments

  • CASH_ADVANCE: Cumulative amount of Cash Advance done by customer

A type of transaction in credit cards where cardholders withdraw cash against their cedit limit. Cash advances usually have high transaction fees and high APR (annual percentage rate)

  • PURCHASES_FREQUENCY: How frequently the Purchases are being made.

How often is the purchase amount being increased. Ranges between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

  • ONEOFF_PURCHASES_FREQUENCY: How frequently Purchases are happening in one-go.

How often is the purchases being paid via one-time payment. Ranges between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

  • PURCHASES_INSTALLMENTS_FREQUENCY: How frequently purchases in installments are being done.

How often is the installment purchase amount being increased. Ranges between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

  • CASH_ADVANCE_FREQUENCY: How frequently the cash in advance being paid

How often is the Cash Advance being paid off by the customer

  • CASH_ADVANCE_TRX: Cumulative Number of Transactions made with “Cash Advance”.

Total count of Cash Advances (Withdraw Extra Cash) that were made using the credit card line. Out of the total number of times a customer used the card, how many times were used to get cash_advance.

  • PURCHASES_TRX: Cumulative number of Purchase transactions made.

Total count of Purchases made.

  • CREDIT LIMIT: Limit of Credit Card for user

Maximum amount that a person can spend on his/her credit card

  • PAYMENTS: Amount of Payment done by the user

Amount of money that the customer pays to their credit card company to repay the amount of money that have borrowed through purchases, cash advances, or any fees/interest charges.

  • MINIMUM_PAYMENTS: Cumulative minimum amount of payments made by the user

Lowest amount of money that a cardholder must pay to their credit card company each billing cycle.

  • PRC_FULL_PAYMENT: Percent of full payment paid by the user

Percentage of the Amount of Money owed to the credit card company paid by the user.

  • TENURE: Tenure of credit card service for user

How long the customer has been with the credit card issuer.

1. Import Libraries¶

In [231]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.cm as cm
import plotly.graph_objects as go
import statsmodels.api as sm
import itertools
from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy import stats

from yellowbrick.cluster import KElbowVisualizer
from yellowbrick.cluster import SilhouetteVisualizer
from scipy.cluster.hierarchy import dendrogram, linkage
In [232]:
from sklearn.preprocessing import StandardScaler, PowerTransformer, OrdinalEncoder, RobustScaler, MinMaxScaler, Normalizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.neighbors import NearestNeighbors
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.manifold import TSNE
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, FactorAnalyzer, calculate_kmo
from sklearn.ensemble import IsolationForest
from sklearn import tree
from kneed import KneeLocator
import pingouin as pg
In [233]:
#SKlearn & other ML Packages
import shap
from sklearn.metrics import r2_score, classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.metrics import homogeneity_score, silhouette_score, log_loss, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.inspection import permutation_importance
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
In [234]:
import warnings
warnings.filterwarnings(action="ignore")
from mpl_toolkits.mplot3d import Axes3D

2. Data¶

In [235]:
df = pd.read_csv('CC GENERAL.csv')
df.head()
Out[235]:
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 C10001 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12
1 C10002 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12
2 C10003 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12
3 C10004 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 NaN 0.000000 12
4 C10005 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12
In [236]:
df.shape
Out[236]:
(8950, 18)
In [237]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   CUST_ID                           8950 non-null   object 
 1   BALANCE                           8950 non-null   float64
 2   BALANCE_FREQUENCY                 8950 non-null   float64
 3   PURCHASES                         8950 non-null   float64
 4   ONEOFF_PURCHASES                  8950 non-null   float64
 5   INSTALLMENTS_PURCHASES            8950 non-null   float64
 6   CASH_ADVANCE                      8950 non-null   float64
 7   PURCHASES_FREQUENCY               8950 non-null   float64
 8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 11  CASH_ADVANCE_TRX                  8950 non-null   int64  
 12  PURCHASES_TRX                     8950 non-null   int64  
 13  CREDIT_LIMIT                      8949 non-null   float64
 14  PAYMENTS                          8950 non-null   float64
 15  MINIMUM_PAYMENTS                  8637 non-null   float64
 16  PRC_FULL_PAYMENT                  8950 non-null   float64
 17  TENURE                            8950 non-null   int64  
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB
In [238]:
df.describe()
Out[238]:
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
count 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8949.000000 8950.000000 8637.000000 8950.000000 8950.000000
mean 1564.474828 0.877271 1003.204834 592.437371 411.067645 978.871112 0.490351 0.202458 0.364437 0.135144 3.248827 14.709832 4494.449450 1733.143852 864.206542 0.153715 11.517318
std 2081.531879 0.236904 2136.634782 1659.887917 904.338115 2097.163877 0.401371 0.298336 0.397448 0.200121 6.824647 24.857649 3638.815725 2895.063757 2372.446607 0.292499 1.338331
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50.000000 0.000000 0.019163 0.000000 6.000000
25% 128.281915 0.888889 39.635000 0.000000 0.000000 0.000000 0.083333 0.000000 0.000000 0.000000 0.000000 1.000000 1600.000000 383.276166 169.123707 0.000000 12.000000
50% 873.385231 1.000000 361.280000 38.000000 89.000000 0.000000 0.500000 0.083333 0.166667 0.000000 0.000000 7.000000 3000.000000 856.901546 312.343947 0.000000 12.000000
75% 2054.140036 1.000000 1110.130000 577.405000 468.637500 1113.821139 0.916667 0.300000 0.750000 0.222222 4.000000 17.000000 6500.000000 1901.134317 825.485459 0.142857 12.000000
max 19043.138560 1.000000 49039.570000 40761.250000 22500.000000 47137.211760 1.000000 1.000000 1.000000 1.500000 123.000000 358.000000 30000.000000 50721.483360 76406.207520 1.000000 12.000000

2.1 Data Cleaning & Handling Missing Values¶

In [239]:
#Check for missing/NULL values
df.isna().sum()
Out[239]:
CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64
In [240]:
corr = df.corr()
plt.figure(figsize = (12, 8))
sns.heatmap(corr, annot = True, cmap='YlGnBu', linewidths=3, fmt=".2f", mask = np.triu(corr))
plt.show()

2.1.1 Linear Regression Imputation for Missing Data¶

From the missing data list, the feature 'MINIMUM_PAYMENTS' have relatively high number of missing values. Hence it will not be ideal to drop those entries. In this case, the Linear Regression Imputation will be applied to fill those missing data so as to not skew our features distribution or introduce any bias. The methodology would be to use a linear regression model to patch the missing data using the correlated explanatory variables. After forming the correlation map, Explanatory variables were chosen for each response variable based on their linear correlation. In this case, the top 2 variable with the highest correlation will be chosen as an explanatory variable since most of the other features have relatively lower correlation.

In [241]:
#May Take awhile to run the patching for NULL values
missing_features = ['MINIMUM_PAYMENTS']
lr_mapping = {}
for i in missing_features:
    exp_var = corr.unstack()[i].sort_values(kind = 'quicksort', ascending = False)[1:3].index
    lr_mapping[i] = exp_var

for key, val in lr_mapping.items():
    df_temp = df.copy()
    df_temp = df.dropna(subset = key, axis = 0)
    X = df_temp[val]
    y = df_temp[key]

    imp_model = LinearRegression()
    imp_model.fit(X, y)
    
    if key == "MINIMUM_PAYMENTS":
        df.loc[df[key].isna(), key] = df.apply(lambda x: round(imp_model.predict([x[val]])[0], 1), axis=1)

2.1.2 Handling the Remaining Missing Data¶

In [242]:
#Since all CUST_ID are truly unique, we can drop the column on cust_id
df = df.drop(['CUST_ID'], axis = 1)

#Patching the null values
df = df.dropna(subset = ['CREDIT_LIMIT'])
In [243]:
#check for dupes
df.duplicated().sum()
Out[243]:
0

2.1.3. Feature Engineering¶

In [244]:
#Monthly Average Purchase over the period the customer has been using the card
df['avg_monthly_purchase'] = df['PURCHASES']/df['TENURE']
                                  
#Monthly Average Cash Advance over the period the customer has been using the card
df['avg_monthly_cash'] = df['CASH_ADVANCE']/df['TENURE']

#Monthly Average Payments over the period the customer has been using the card
df['avg_monthly_installments'] = df['INSTALLMENTS_PURCHASES']/df['TENURE']

#Utilization rate of Credit Card
df['utilisation_ratio'] = (df['CREDIT_LIMIT'] - df['BALANCE'])/df['CREDIT_LIMIT']

#Based on the type of purchases, we can classify or categorise it according to type of purchase they made.
df['type_of_purchase'] = np.where((df['INSTALLMENTS_PURCHASES'] > 0) & (df['ONEOFF_PURCHASES'] == 0), 'Installment',
                        np.where((df['INSTALLMENTS_PURCHASES'] == 0) & (df['ONEOFF_PURCHASES'] > 0), 'One_off',
                        np.where((df['INSTALLMENTS_PURCHASES'] > 0) & (df['ONEOFF_PURCHASES'] > 0), 'Both', 'Cash')))
In [245]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8949 entries, 0 to 8949
Data columns (total 22 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   BALANCE                           8949 non-null   float64
 1   BALANCE_FREQUENCY                 8949 non-null   float64
 2   PURCHASES                         8949 non-null   float64
 3   ONEOFF_PURCHASES                  8949 non-null   float64
 4   INSTALLMENTS_PURCHASES            8949 non-null   float64
 5   CASH_ADVANCE                      8949 non-null   float64
 6   PURCHASES_FREQUENCY               8949 non-null   float64
 7   ONEOFF_PURCHASES_FREQUENCY        8949 non-null   float64
 8   PURCHASES_INSTALLMENTS_FREQUENCY  8949 non-null   float64
 9   CASH_ADVANCE_FREQUENCY            8949 non-null   float64
 10  CASH_ADVANCE_TRX                  8949 non-null   int64  
 11  PURCHASES_TRX                     8949 non-null   int64  
 12  CREDIT_LIMIT                      8949 non-null   float64
 13  PAYMENTS                          8949 non-null   float64
 14  MINIMUM_PAYMENTS                  8949 non-null   float64
 15  PRC_FULL_PAYMENT                  8949 non-null   float64
 16  TENURE                            8949 non-null   int64  
 17  avg_monthly_purchase              8949 non-null   float64
 18  avg_monthly_cash                  8949 non-null   float64
 19  avg_monthly_installments          8949 non-null   float64
 20  utilisation_ratio                 8949 non-null   float64
 21  type_of_purchase                  8949 non-null   object 
dtypes: float64(18), int64(3), object(1)
memory usage: 1.6+ MB
In [246]:
df.columns
Out[246]:
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE', 'avg_monthly_purchase', 'avg_monthly_cash',
       'avg_monthly_installments', 'utilisation_ratio', 'type_of_purchase'],
      dtype='object')

2.2 Quantile Statistics¶

In [247]:
# Quantile Statistics
num_features = ['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE', 'avg_monthly_purchase', 'avg_monthly_cash',
       'avg_monthly_installments', 'utilisation_ratio']
quantile = df[num_features].describe().loc[["min", "25%", "50%", "75%", "max"]].transpose()
quantile.columns = ["min", "Q1", "Q2", "Q3", "max"]
quantile["range"] = quantile["max"] - quantile["min"]
quantile["IQR"] = quantile["Q3"] - quantile["Q1"]
quantile
Out[247]:
min Q1 Q2 Q3 max range IQR
BALANCE 0.000000 128.365782 873.680279 2054.372848 19043.138560 19043.138560 1926.007066
BALANCE_FREQUENCY 0.000000 0.888889 1.000000 1.000000 1.000000 1.000000 0.111111
PURCHASES 0.000000 39.800000 361.490000 1110.170000 49039.570000 49039.570000 1070.370000
ONEOFF_PURCHASES 0.000000 0.000000 38.000000 577.830000 40761.250000 40761.250000 577.830000
INSTALLMENTS_PURCHASES 0.000000 0.000000 89.000000 468.650000 22500.000000 22500.000000 468.650000
CASH_ADVANCE 0.000000 0.000000 0.000000 1113.868654 47137.211760 47137.211760 1113.868654
PURCHASES_FREQUENCY 0.000000 0.083333 0.500000 0.916667 1.000000 1.000000 0.833334
ONEOFF_PURCHASES_FREQUENCY 0.000000 0.000000 0.083333 0.300000 1.000000 1.000000 0.300000
PURCHASES_INSTALLMENTS_FREQUENCY 0.000000 0.000000 0.166667 0.750000 1.000000 1.000000 0.750000
CASH_ADVANCE_FREQUENCY 0.000000 0.000000 0.000000 0.222222 1.500000 1.500000 0.222222
CASH_ADVANCE_TRX 0.000000 0.000000 0.000000 4.000000 123.000000 123.000000 4.000000
PURCHASES_TRX 0.000000 1.000000 7.000000 17.000000 358.000000 358.000000 16.000000
CREDIT_LIMIT 50.000000 1600.000000 3000.000000 6500.000000 30000.000000 29950.000000 4900.000000
PAYMENTS 0.000000 383.282850 857.062706 1901.279320 50721.483360 50721.483360 1517.996470
MINIMUM_PAYMENTS -167.100000 166.000000 299.351881 805.478131 76406.207520 76573.307520 639.478131
PRC_FULL_PAYMENT 0.000000 0.000000 0.000000 0.142857 1.000000 1.000000 0.142857
TENURE 6.000000 12.000000 12.000000 12.000000 12.000000 6.000000 0.000000
avg_monthly_purchase 0.000000 3.422500 31.940000 97.230000 4086.630833 4086.630833 93.807500
avg_monthly_cash 0.000000 0.000000 0.000000 99.107730 3928.100980 3928.100980 99.107730
avg_monthly_installments 0.000000 0.000000 7.737143 40.710000 1875.000000 1875.000000 40.710000
utilisation_ratio -14.909951 0.282418 0.697130 0.958473 1.000000 15.909951 0.676055

2.3 Descriptive Statistics¶

In [248]:
#General Descriptive Statistics for Numerical Features
descriptive_stats = df[num_features].describe().loc[["mean", "std", "50%"]].transpose()
descriptive_stats.columns = ["mean", "std", "median"]
descriptive_stats["mode"] = df[num_features].mode().transpose()[0]
descriptive_stats["mad"] = [stats.median_abs_deviation(df[x]) for x in num_features]
descriptive_stats["kurtosis"] = [stats.kurtosis(df[x], fisher=False) for x in num_features]
descriptive_stats["excess kurtosis"] = [stats.kurtosis(df[x], fisher=True) for x in num_features]
descriptive_stats["skewness"] = [stats.skew(df[x]) for x in num_features]
descriptive_stats
Out[248]:
mean std median mode mad kurtosis excess kurtosis skewness
BALANCE 1564.647593 2081.584016 873.680279 0.0 800.045248 10.669089 7.669089 2.392869
BALANCE_FREQUENCY 0.877350 0.236798 1.000000 1.0 0.000000 6.095206 3.095206 -2.023854
PURCHASES 1003.316936 2136.727848 361.490000 0.0 361.490000 114.317027 111.317027 8.142604
ONEOFF_PURCHASES 592.503572 1659.968851 38.000000 0.0 38.000000 167.079674 164.079674 10.042938
INSTALLMENTS_PURCHASES 411.113579 904.378205 89.000000 0.0 89.000000 99.512552 96.512552 7.297600
CASH_ADVANCE 978.959616 2097.264344 0.000000 0.0 0.000000 55.863880 52.863880 5.165457
PURCHASES_FREQUENCY 0.490405 0.401360 0.500000 1.0 0.416667 1.361634 -1.638366 0.059960
ONEOFF_PURCHASES_FREQUENCY 0.202480 0.298345 0.083333 0.0 0.083333 4.160000 1.160000 1.535196
PURCHASES_INSTALLMENTS_FREQUENCY 0.364478 0.397451 0.166667 0.0 0.166667 1.601313 -1.398687 0.508938
CASH_ADVANCE_FREQUENCY 0.135141 0.200132 0.000000 0.0 0.000000 6.331658 3.331658 1.828338
CASH_ADVANCE_TRX 3.249078 6.824987 0.000000 0.0 0.000000 64.605263 61.605263 5.720017
PURCHASES_TRX 14.711476 24.858552 7.000000 0.0 7.000000 37.770493 34.770493 4.629717
CREDIT_LIMIT 4494.449450 3638.815725 3000.000000 3000.0 1800.000000 5.834401 2.834401 1.522209
PAYMENTS 1733.336511 2895.168146 857.062706 0.0 581.375627 57.736011 54.736011 5.906475
MINIMUM_PAYMENTS 847.527175 2335.006965 299.351881 157.8 180.111631 295.404391 292.404391 13.809641
PRC_FULL_PAYMENT 0.153732 0.292511 0.000000 0.0 0.000000 5.429630 2.429630 1.942316
TENURE 11.517935 1.337134 12.000000 12.0 0.000000 10.702409 7.702409 -2.944294
avg_monthly_purchase 86.184802 180.516574 31.940000 0.0 31.940000 110.731438 107.731438 8.002899
avg_monthly_cash 88.984447 193.145939 0.000000 0.0 0.000000 47.151336 44.151336 4.943233
avg_monthly_installments 35.283977 76.593118 7.737143 0.0 7.737143 96.418009 93.418009 7.184574
utilisation_ratio 0.611074 0.389722 0.697130 1.0 0.283333 282.479014 279.479014 -7.416309
In [249]:
#As observed there is large amount of outliers and distribution isnt normal since they are highly skewed to either sides.
fig= plt.subplots(nrows = 7 , ncols = 3, figsize=(15,15))
for i in range(len(df[num_features].columns)):
    plt.subplot(7, 3, i+1)
    ax=  sns.boxplot(x = df[num_features][df[num_features].columns[i]])
    plt.title(df[num_features].columns[i])
    plt.tight_layout()

2.4 Handling Outliers (Isolation Forest)¶

From the quantile and descriptive statistics, it tells us that the distribution of the dataset set is extremely skewed (not normal) and there exists huge variances in the feature range. Hence, there exists alot of outliers as seen in the boxplot afterwards. To combat this, Isolation Forest is used as it is known to be computationally efficient and does not assume normal distribution. Additionally, it is able to detect outliers in a multi-dimensional level.

Isolation Forest is based on the concept of using a Random Forest of decision trees to isolate outliers from the rest of the data. It works by randomly selecting a feature and a split value at each node of a decision tree. It continues to split the data until all of the points hvae been isolated into individual trees. This idea comes that outliers will require fewer splits to isolate since they are further away from the majority of the data points. This means that outliers will have shorter path length in the tree than normal data points which makes it easier to identify them. After identifying this outliers, the data points are dropped for easier segmentation of customers in the following portions.

In [250]:
features = ['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT', 'TENURE', 'avg_monthly_purchase', 'avg_monthly_cash',
       #'trx_monthly', 'cash_advance_trx_monthly', 'purchases_trx_monthly', 
            'avg_monthly_installments', 'utilisation_ratio']

clean_df = df[features].copy()
model = IsolationForest(n_estimators=150, max_samples='auto', contamination=float(0.1), max_features=1.0)
model.fit(clean_df)
Out[250]:
IsolationForest(contamination=0.1, n_estimators=150)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
IsolationForest(contamination=0.1, n_estimators=150)
In [251]:
# Adding 'scores' and 'anomaly' colums to df
scores = model.decision_function(clean_df)
anomaly = model.predict(clean_df) #pred
clean_df['scores'] = scores
clean_df['anomaly'] = anomaly #pred

outliers = clean_df.loc[clean_df['anomaly'] == -1]
outliers_index = list(outliers.index)

anomaly = clean_df.loc[clean_df['anomaly'] == -1]
anomaly_index = list(anomaly.index)

print('Num of Outliers:', len(anomaly))
Num of Outliers: 895
In [252]:
if 8949 in outliers_index:
    outliers_index.remove(8949)
In [253]:
pca = PCA(n_components = 2, random_state = 0)
pca.fit(clean_df)
res = pd.DataFrame(pca.transform(clean_df))
Z = np.array(res)

plt.title("IsolationForest")
plt.contourf(Z, cmap=plt.cm.Blues_r)
b1 = plt.scatter(res[0], res[1], c='green', s = 20,label = "normal points")
b1 = plt.scatter(res.iloc[outliers_index, 0], res.loc[outliers_index, 1], 
                 c = 'red',s = 20, edgecolor = "red", label = "predicted outliers")
plt.legend(loc = "upper right")
plt.show()
In [254]:
%matplotlib inline

pca = PCA(n_components=3)  # Reduce to k=3 dimensions
scaler = StandardScaler()
#normalize the metrics
X = scaler.fit_transform(clean_df)
X_reduce = pca.fit_transform(X)
fig = plt.figure()
ax = Axes3D(fig)
ax.set_zlabel("x_composite_3")
# Plot the compressed data points
ax.scatter(X_reduce[:, 0], X_reduce[:, 1], zs=X_reduce[:, 2], s=4, lw=1, label="inliers",c="green")
# Plot x's for the ground truth outliers
ax.scatter(X_reduce[outliers_index, 0],X_reduce[outliers_index,1], X_reduce[outliers_index,2],
           lw=2, s=60, marker="x", c="red", label="outliers")
ax.legend()

ax.view_init(azim = -230, elev = 25)
plt.show()
In [255]:
# dropping outliers
clean_df = clean_df.drop(anomaly_index, axis = 0).reset_index(drop=True)

3. Exploratory Data Analysis (EDA)¶

In [256]:
df.columns
Out[256]:
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE', 'avg_monthly_purchase', 'avg_monthly_cash',
       'avg_monthly_installments', 'utilisation_ratio', 'type_of_purchase'],
      dtype='object')

3.1 General Features Distribution & Correlations¶

In [257]:
fig= plt.subplots(nrows = 9, ncols = 3, figsize=(15, 15))
for i in range (len(clean_df.columns)):
    plt.subplot(9, 3, i+1)
    ax =  sns.boxplot(x = clean_df[clean_df.columns[i]])
    plt.title(clean_df.columns[i])
    plt.tight_layout()
In [258]:
plt.figure(figsize = (15, 40))

for i in range(len(clean_df.columns)):
    plt.subplot(10, 3, i+1)
    sns.histplot(clean_df[clean_df.columns[i]] , kde = True)
    plt.title(clean_df.columns[i])
    
plt.tight_layout()

Observation:

  • Most credit card holders have low credit limit and maintained a rough credit balance below 7500.
  • Variables such as Purchases, OneOff_Purchases, installment_purchases and cash_advances also follow the same trend as credit balance. They could all be related. This can be interpreted where if the credit balance is low, the purchases are also low and so on.
  • Most people either don't purchase anything (Opt for Cash Advance) or they just purchase very frequently.
  • People who purchase in installments is more than people who purchase in one-go.
  • In the last 6 months, most people have made total payments below 10,000, with the minimum payments below 5000.
  • Finally, most of the credit card holders own a card for more than 12 months.
In [259]:
clean_corr = clean_df.corr()
plt.figure(figsize = (12, 8))

sns.heatmap(clean_corr, annot = True, cmap=sns.diverging_palette(220, 10, as_cmap=True), 
            linewidths=3, fmt=".2f", mask = np.triu(clean_corr))
plt.show()
  1. Strong Correlations
  • Purchases and One-off Purchases
  • Purchases and Purchase Transactions
  • Cash advance and Cash advance transactions
  • Purchase Frequency and Purchase Instalments Frequency
  • Instalment Purchases and Purchases Transactions
  1. Medium Correlations:
  • Balance and Cash Advance
  • Balance and Cash Advance Frequency
  • Balance and Credit Limit
  • Payments and Cash Advance
  • Payments and Credit Limit

3.2 Continuous Variables Exploration¶

In [260]:
#Visualising the distribution density of independent continuous variables to the y variable: size, height, bust
viz_hist_cols = ['PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES','CASH_ADVANCE']

for col in viz_hist_cols:
    sns.histplot(data=df, x=col, bins=10, hue="type_of_purchase", element="step",stat="density", common_norm=False).set(title=f"Distribution of {col} grouped by Type of Purchase")
    plt.show()
In [261]:
#Visualising the distribution density of weight and age
viz_kde_cols = ['PURCHASES_FREQUENCY','ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY','CASH_ADVANCE_FREQUENCY']

for col in viz_kde_cols:
    sns.kdeplot(data=df, x=col, hue="type_of_purchase").set(title=f"Distribution of {col} grouped by Type of Purchase")
    plt.show()
In [262]:
frequency_features = ['BALANCE_FREQUENCY', 'PURCHASES_FREQUENCY','ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 
                      'CASH_ADVANCE_FREQUENCY','PRC_FULL_PAYMENT']

df[frequency_features].plot.kde(figsize=(8,5), bw_method=3) #,ind=[0, 2, 3,4]
plt.show()

3.3 Joint Distribution Plots¶

In [263]:
clean_df.columns
Out[263]:
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE', 'avg_monthly_purchase', 'avg_monthly_cash',
       'avg_monthly_installments', 'utilisation_ratio', 'scores', 'anomaly'],
      dtype='object')
In [264]:
joint_var = ['PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT', 'PAYMENTS']
plt.figure(figsize = (20,16))
joint_count = 0

colors = itertools.cycle(["r", "b", "g"])

for i, c in zip(joint_var, colors):
    plt.subplot(2, 3, joint_count + 1)
    joint_count += 1
    plt.scatter(clean_df['BALANCE'], clean_df[i], alpha = 0.75, color = next(colors))
    plt.title(f'BALANCE vs {i}')
    plt.xlabel('BALANCE')
    plt.ylabel(i)
In [265]:
purchases_types = ['ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PAYMENTS', 'CREDIT_LIMIT']
plt.figure(figsize = (20,16))
joint_count = 0

colors = itertools.cycle(["r", "b", "g"])

for i, c in zip(purchases_types, colors):
    plt.subplot(2, 3, joint_count + 1)
    joint_count += 1
    plt.scatter(clean_df['PURCHASES'], clean_df[i], alpha = 0.75, color = next(colors))
    plt.title(f'PURCHASES vs {i}')
    plt.xlabel('PURCHASES')
    plt.ylabel(i)
In [266]:
cl_joint = ['BALANCE', 'PURCHASES', 'PAYMENTS', 'TENURE', 'CASH_ADVANCE']
plt.figure(figsize = (20,16))
joint_count = 0

colors = itertools.cycle(["r", "b", "g"])

for i, c in zip(cl_joint, colors):
    plt.subplot(2, 3, joint_count + 1)
    joint_count += 1
    plt.scatter(clean_df['CREDIT_LIMIT'], clean_df[i], color = next(colors))
    plt.title(f'CREDIT_LIMIT vs {i}')
    plt.xlabel('CREDIT_LIMIT')
    plt.ylabel(i)
In [267]:
tenure_joint = ['PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY']

for i in tenure_joint:
    sns.jointplot(x = 'TENURE', y = i, data = clean_df)
    plt.show()

3.4 Categorical Features Exploration¶

In this section, the variable 'type_of_purchase' will be mainly used to show the potential trends. The feature is mainly created to get an overview of the type of users in the dataset since it doesn't have categorical data (maybe except Tenure). However, this variable will not be used as part of the other models or clustering in the future sections as it may cloud the results. This is because the feature is similar to a label and may affect the process of unsupervised learning models.

In [268]:
sns.countplot(data = df, x = 'type_of_purchase')
plt.title('Counts of Type of Credit Card Purchases')
plt.show()
In [269]:
#Most of these cardholders already seem to have long tenure periods.
df.groupby('type_of_purchase')['TENURE'].value_counts().unstack()
Out[269]:
TENURE 6 7 8 9 10 11 12
type_of_purchase
Both 29 30 36 28 51 81 2519
Cash 66 61 62 57 71 125 1599
Installment 64 56 47 55 54 73 1911
One_off 44 43 51 35 60 86 1555
In [270]:
plt.figure(figsize = (20,10))
sns.barplot(data = df, x = df['TENURE'].unique(), y = df['TENURE'].value_counts(dropna = False))
plt.xlabel('Tenure')
plt.ylabel('Customers count')
plt.title('Customer count per Tenure')
plt.show()
#Looks like most customers in the past 6 months are already long-term customers.
In [271]:
# Visualising spendings on type of purchase
spending_features = ['avg_monthly_purchase', 'avg_monthly_cash',
                     'avg_monthly_installments']

for col in spending_features:
    sns.barplot(data = df, x = 'type_of_purchase', y = col).set(title=f" {col} per Type of Purchase")
    plt.show()
    
# Based on this plot, it seems that every type of customers that are tagged by their type_of_purchase seem to use Cash Advance as seen in the 2nd chart.
In [272]:
# Visualising spendings on tenure duration
for col in spending_features:
    sns.barplot(data = df, x = 'TENURE', y = col).set(title=f" {col} per TENURE")
    plt.show()

#Interestingly, people who shorter tenure also tend to use a higher amount of Cash Advance.

3.5 Factor Analysis (Determining Variance in Features)¶

In [273]:
clean_df.drop(columns ='scores' , axis=1 , inplace=True)
clean_df.drop(columns ='anomaly' , axis=1 , inplace=True)
x = clean_df.copy()
chi_square_value, p_value = calculate_bartlett_sphericity(x)
chi_square_value, p_value #statistically significant
Out[273]:
(264514.9893792681, 0.0)

KMO Test measures the proportion of variance that might be a common variance among the variables. Larger proportions are expected as it represents more correlation is present among the variables thereby giving way for the application of dimensionality reduction techniques such as Factor Analysis. KMO score is always between 0 to 1 and values more than 0.6 are much appreciated.

In [274]:
kmo_all,kmo_model=calculate_kmo(x)
print(kmo_model, len(kmo_all)) 
0.7228304745057077 21
In [275]:
fa = FactorAnalyzer()
fa.fit(x, 23)

#Get Eigen values and plot them
ev, v = fa.get_eigenvalues()

#plot eigenvector values
plt.scatter(range(1, x.shape[1] + 1),ev)
plt.plot(range(1, x.shape[1] + 1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.show()
#Use 3 factors here since the drop is huge on 3 factors onwards
In [276]:
fa = FactorAnalyzer(n_factors=3,rotation ='varimax')
fa.fit(x)
loads = fa.loadings_
print(pd.DataFrame(fa.loadings_, index = x.columns))
                                         0         1         2
BALANCE                          -0.076213  0.693062  0.134530
BALANCE_FREQUENCY                 0.183959  0.292503  0.176052
PURCHASES                         0.473498 -0.109829  0.845396
ONEOFF_PURCHASES                 -0.034486 -0.110229  0.970549
INSTALLMENTS_PURCHASES            0.910120 -0.052828  0.193477
CASH_ADVANCE                     -0.153736  0.852126 -0.069621
PURCHASES_FREQUENCY               0.689618 -0.282352  0.326847
ONEOFF_PURCHASES_FREQUENCY        0.044330 -0.128597  0.723804
PURCHASES_INSTALLMENTS_FREQUENCY  0.827135 -0.203077  0.054693
CASH_ADVANCE_FREQUENCY           -0.204512  0.764340 -0.120368
CASH_ADVANCE_TRX                 -0.161465  0.736976 -0.098099
PURCHASES_TRX                     0.589996 -0.102942  0.517114
CREDIT_LIMIT                      0.040000  0.233092  0.304076
PAYMENTS                          0.108932  0.330950  0.338249
MINIMUM_PAYMENTS                  0.017809  0.252460  0.002470
PRC_FULL_PAYMENT                  0.237731 -0.296473  0.061291
TENURE                            0.064622  0.008373  0.129967
avg_monthly_purchase              0.468078 -0.117571  0.833603
avg_monthly_cash                 -0.159306  0.825895 -0.084073
avg_monthly_installments          0.914345 -0.059544  0.179823
utilisation_ratio                 0.106741 -0.448484  0.049690
In [277]:
#Factor 1: 
factor_1 = pd.DataFrame(fa.loadings_, index=x.columns)
factor_1 = factor_1[factor_1[0] > 0.4]
print("Factor 1: ", factor_1.index)

#Factor 2:
factor_2 = pd.DataFrame(fa.loadings_, index=x.columns)
factor_2 = factor_2[factor_2[1] > 0.4]
print("Factor 2: ", factor_2.index)

#Factor 3: 
factor_3 = pd.DataFrame(fa.loadings_, index=x.columns)
factor_3 = factor_3[factor_3[2] > 0.4]
print("Factor 3: ", factor_3.index)
Factor 1:  Index(['PURCHASES', 'INSTALLMENTS_PURCHASES', 'PURCHASES_FREQUENCY',
       'PURCHASES_INSTALLMENTS_FREQUENCY', 'PURCHASES_TRX',
       'avg_monthly_purchase', 'avg_monthly_installments'],
      dtype='object')
Factor 2:  Index(['BALANCE', 'CASH_ADVANCE', 'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX',
       'avg_monthly_cash'],
      dtype='object')
Factor 3:  Index(['PURCHASES', 'ONEOFF_PURCHASES', 'ONEOFF_PURCHASES_FREQUENCY',
       'PURCHASES_TRX', 'avg_monthly_purchase'],
      dtype='object')
In [278]:
#Create the factors
factor1 = x[factor_1.index]
factor2 = x[factor_2.index]
factor3 = x[factor_3.index]

#Get cronbach alpha
factor1_alpha = pg.cronbach_alpha(factor1)
factor2_alpha = pg.cronbach_alpha(factor2)
factor3_alpha = pg.cronbach_alpha(factor3)
print(factor1_alpha, factor2_alpha, factor3_alpha)
#Factor 3 isnt as useful compared to factors 1 and 2
(0.5281596976849577, array([0.512, 0.544])) (0.4763970245588475, array([0.458, 0.494])) (0.6299935718419716, array([0.617, 0.643]))
In [279]:
print(pd.DataFrame(fa.get_factor_variance(),index=['Variance','Proportional Var','Cumulative Var']))
                         0         1         2
Variance          3.861044  3.810632  3.637561
Proportional Var  0.183859  0.181459  0.173217
Cumulative Var    0.183859  0.365318  0.538535
In [280]:
# Communality is the proportion of each variable’s variance 
# that can be explained by the factors. Rotations don’t have any influence over the communality of the variables.
print(pd.DataFrame(fa.get_communalities(), index=x.columns, columns=['Communalities']))
                                  Communalities
BALANCE                                0.504242
BALANCE_FREQUENCY                      0.150393
PURCHASES                              0.950957
ONEOFF_PURCHASES                       0.955306
INSTALLMENTS_PURCHASES                 0.868543
CASH_ADVANCE                           0.754601
PURCHASES_FREQUENCY                    0.662124
ONEOFF_PURCHASES_FREQUENCY             0.542394
PURCHASES_INSTALLMENTS_FREQUENCY       0.728383
CASH_ADVANCE_FREQUENCY                 0.640529
CASH_ADVANCE_TRX                       0.578828
PURCHASES_TRX                          0.626099
CREDIT_LIMIT                           0.148394
PAYMENTS                               0.235806
MINIMUM_PAYMENTS                       0.064059
PRC_FULL_PAYMENT                       0.148169
TENURE                                 0.021138
avg_monthly_purchase                   0.927814
avg_monthly_cash                       0.714549
avg_monthly_installments               0.871909
utilisation_ratio                      0.215000

4. Customer Segmentation / Clustering¶

In [281]:
#Scaling the dataframe features before clustering
scaler = StandardScaler()
scaled_features=scaler.fit_transform(clean_df)

#Created scaled_df for any reference afterwards.
scaled_df=pd.DataFrame(scaler.fit_transform(clean_df), columns = clean_df.columns)

Elbow Method¶

In [282]:
params = {"init":"random","n_init":10,"max_iter":300,"random_state":0}
cluster_range = range( 1, 21 )
cluster_errors=[]
for k in cluster_range:
    kmeans= KMeans(n_clusters=k, **params) #** open dictionry
    kmeans.fit(scaled_features)
    cluster_errors.append(kmeans.inertia_) 
In [283]:
model = KMeans(n_clusters=k, **params)

# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(2,30), timings= True)
visualizer.fit(scaled_features)        
visualizer.show()        
plt.show()

Knee Method (Kneed)¶

In [284]:
k1 = KneeLocator(range(1,21), cluster_errors , curve='convex', direction= 'decreasing')
plt.plot(range(1, 21), cluster_errors)
plt.xticks(range(1, 21))
plt.xlabel('No. of Clusters')
plt.ylabel('List')
plt.axvline(x = k1.elbow, color='b', ls= '--')
plt.show()

Silhouette Coefficient¶

In [285]:
silhouette_coefficients =[]
#min of clusters is 2
for k in range(2, 21):
    kmeans = KMeans(n_clusters = k, **params)
    kmeans.fit(scaled_features)
    score = silhouette_score(scaled_features, kmeans.labels_)
    silhouette_coefficients.append(score)
In [286]:
k = [i for i in range(2,21)]
ax = sns.barplot(x = k, y = silhouette_coefficients) # k = 3 seems to have the highest score
ax.set(xlabel = 'Clusters', ylabel = 'Score')
plt.title('Silhouette coefficients score by cluster')
plt.show()
In [287]:
fig, ax = plt.subplots(3, 2, figsize=(20,15))
fig.suptitle('Silhouette Analysis for 2-6 Clusters', size = 18)
plt.tight_layout()

for i in range(2,7):
    km = KMeans(n_clusters = i, init = 'k-means++', n_init = 10, max_iter = 100, random_state = 0)
    q, mod = divmod(i, 2)
    visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax = ax[q-1][mod])
    visualizer.fit(scaled_features)
    #visualizer.show()
In [288]:
km = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 100, random_state = 0)
visualizer = SilhouetteVisualizer(km, colors='yellowbrick')
visualizer.fit(scaled_features)
visualizer.show()
Out[288]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 8054 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Dendograms¶

In [289]:
distance = linkage(scaled_features, 'ward')
In [346]:
plt.figure(figsize=(15,5))
plt.title("Hierarchical-Clustering Dendogram")
plt.xlabel("Index")
plt.ylabel("Ward's Distance")
dendrogram(distance, leaf_rotation=90, leaf_font_size=9); #Takes awhile to run!
#plt.axhline(160, c='k')
plt.show()
#Based on the dendogram, to cut off the longest line the graph, we chose a line within 140 to 170.

K-means Model¶

In [291]:
km_model = KMeans(n_clusters = 4, random_state = 0)
y_pred_km = km_model.fit_predict(scaled_features)
In [292]:
#centroids = pd.DataFrame(data = kmeans.cluster_centers_, columns = [clean_df.columns])
#centroids
In [293]:
clean_df['labels'] = km_model.labels_
In [294]:
clean_df.columns
Out[294]:
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE', 'avg_monthly_purchase', 'avg_monthly_cash',
       'avg_monthly_installments', 'utilisation_ratio', 'labels'],
      dtype='object')
In [295]:
base_features = df.drop(columns = ['avg_monthly_purchase', 'avg_monthly_cash', 
                                   'type_of_purchase']).columns
fig = plt.subplots(nrows = 8, ncols = 3, figsize = (20,20))
counter = 0

for i in base_features:
    plt.subplot(8, 3, counter + 1)
    counter += 1
    ax = sns.barplot(x = clean_df.groupby(clean_df['labels']).mean().index, 
                y = i, data = clean_df.groupby(clean_df['labels']).mean())
    plt.title('Mean ' + str(i) + ' by Cluster')
    plt.tight_layout()
In [296]:
# Plot the histogram of various clusters
for i in clean_df.columns:
    plt.figure(figsize = (30, 5))
    for j in range(clean_df['labels'].nunique()):
        plt.subplot(1, clean_df['labels'].nunique(), j+1)
        ax = clean_df[clean_df['labels'] == j]
        ax[i].hist(bins=20)
        plt.title('{}    \nCLUSTER {} '.format(i, j))
plt.show()
In [297]:
#Lets visualise the clusters in 2D
pca = PCA(n_components=2)
pca_components = pca.fit_transform(scaled_features)
pca_df = pd.DataFrame(data = pca_components, columns = ['pca1', 'pca2'])
pca_df.head()
Out[297]:
pca1 pca2
0 -0.891104 -2.291708
1 -4.047512 3.632017
2 1.101006 0.466706
3 0.027234 -0.096090
4 -1.385667 -1.522292
In [298]:
# Concatenate the clusters labels to the dataframe
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster': km_model.labels_})], axis = 1)
pca_df.head()
Out[298]:
pca1 pca2 cluster
0 -0.891104 -2.291708 1
1 -4.047512 3.632017 3
2 1.101006 0.466706 1
3 0.027234 -0.096090 1
4 -1.385667 -1.522292 1
In [347]:
# Visualize the clusters in 2-axes plane
plt.figure(figsize = (8,5))
ax = sns.scatterplot(x="pca1", y="pca2", hue = "cluster", data = pca_df, palette=['orange','brown','dodgerblue', 'green', 'red', 'blue'])
plt.title('PCA Plot for 4 Clusters')
plt.show()

TSNE Plot¶

In [300]:
tsne_model = make_pipeline(StandardScaler(), TSNE(n_components = 2, random_state = 0))
tsne_data = tsne_model.fit_transform(clean_df)

# creating a new data fram which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, clean_df.labels)).T
tsne_df = pd.DataFrame(data = tsne_data, columns=("Dim_1", "Dim_2", "label"))
In [301]:
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue="label", height=8, palette=['orange','brown','dodgerblue', 'green', 'red', 'blue']).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.title('t-SNE plot for Customer Segments')
plt.show()

6. Behavioural Analysis¶

In [302]:
#available features

#purchases = ['PURCHASES', 'ONEOFF_PURCHASES',
#        'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE]
#frequency = ['PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 
#                'CASH_ADVANCE_FREQUENCY', 'BALANCE_FREQUENCY']
#monthly = ['avg_monthly_purchase', 'avg_monthly_cash',
#        'avg_monthly_installments']
#others = ['utilisation_ratio','CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT', 'TENURE',
#        'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'BALANCE']
In [303]:
#Adjust Features here
Purchase = ["PURCHASES", "CASH_ADVANCE", "INSTALLMENTS_PURCHASES",
                      'ONEOFF_PURCHASES', 'BALANCE', #'PAYMENTS', 
                      "CREDIT_LIMIT", 'PAYMENTS',  'MINIMUM_PAYMENTS', 'labels']

Count = ['PURCHASES_TRX', 'CASH_ADVANCE_TRX', 'TENURE', 'labels']

Frequency = ['PURCHASES_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
                      'ONEOFF_PURCHASES_FREQUENCY', 'BALANCE_FREQUENCY', 'utilisation_ratio', 'labels']

all_features = [Purchase, Count, Frequency]
In [304]:
counter = 0
for i in all_features:
    name = ''
    if counter == 0:
        name = 'Purchase'
    elif counter == 1:
        name = 'Count'
    else:
        name = 'Frequency'
    counter += 1
    
    clean_df[i].groupby(['labels']).mean().plot.bar(figsize = (15,5))
    plt.title(f'{name} Behaviors of Various Segments')
    plt.xlabel('Segment / Labels');
    plt.show()

clean_df['labels'].value_counts()
Out[304]:
1    3365
0    2278
3    1330
2    1081
Name: labels, dtype: int64
In [305]:
#'avg_monthly_payments' either way if they use the card, they'll make payments
avg_features = ['avg_monthly_purchase', 'avg_monthly_cash', 
                'avg_monthly_installments', 'labels']
analysis_df = clean_df.copy()

analysis_df[avg_features].groupby("labels").mean().plot.bar(figsize=(15, 5))

plt.title('Average Behavior of Various Segments')
plt.xlabel('Segment / Labels');
plt.show()
In [306]:
# get the values for each col grouped by cluster
behavior_table = clean_df.groupby('labels')[["PURCHASES", "CASH_ADVANCE", "INSTALLMENTS_PURCHASES",
                      'ONEOFF_PURCHASES', 'utilisation_ratio', #'PAYMENTS', 
                      "CREDIT_LIMIT", 'PAYMENTS','MINIMUM_PAYMENTS', 'PURCHASES_TRX', 'CASH_ADVANCE_TRX']]

n_customers = behavior_table.size()
#description['n_customers'] = n_customers
behavior_table = behavior_table.mean()
behavior_table.style.background_gradient(cmap='YlGnBu')
Out[306]:
  PURCHASES CASH_ADVANCE INSTALLMENTS_PURCHASES ONEOFF_PURCHASES utilisation_ratio CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PURCHASES_TRX CASH_ADVANCE_TRX
labels                    
0 642.183982 109.978963 481.368534 161.439056 0.767346 3311.063111 802.161004 469.790589 15.118086 0.500878
1 250.831447 368.145209 38.631076 212.346993 0.586889 3202.318970 813.312449 552.604758 3.240416 1.585438
2 2477.163895 190.412958 947.826614 1529.382609 0.718032 5791.518796 2340.459019 592.806467 36.727105 0.728030
3 200.352737 2941.678073 67.417150 133.012030 0.377535 5612.572340 2167.597754 1465.857342 3.372180 9.601504
In [307]:
avg_behavior_df = clean_df.groupby("labels")[['avg_monthly_purchase', #'avg_monthly_cash', 
                                              'avg_monthly_installments', 'utilisation_ratio']].mean()
avg_behavior_df = pd.DataFrame(MinMaxScaler().fit_transform(avg_behavior_df), 
                            index=avg_behavior_df.index, columns=avg_behavior_df.columns)
avg_behavior_df
Out[307]:
avg_monthly_purchase avg_monthly_installments utilisation_ratio
labels
0 0.202656 0.511325 1.000000
1 0.024689 0.000000 0.537065
2 1.000000 1.000000 0.873494
3 0.000000 0.033135 0.000000

7. Multinomial Logistic Regression¶

Base Model¶

In [308]:
X = clean_df.copy()
y = X['labels']
X = X.drop(columns = ['labels'], axis = 1)
In [309]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y, random_state = 0)
In [310]:
#transformer = FactorAnalysis(n_components=3, random_state=0)
In [311]:
y.value_counts().plot(kind = 'bar')
plt.title("Number of Customers per Segment")
plt.show()
In [312]:
lr_model = make_pipeline(StandardScaler(), #transformer, 
                         LogisticRegression(multi_class='auto', random_state = 0))
lr_model.fit(X_train, y_train)
lr_y_pred = lr_model.predict(X_test)
In [313]:
print('Accuracy: ', accuracy_score(y_test, lr_y_pred))
print(classification_report(y_test, lr_y_pred))
cm = confusion_matrix(y_test, lr_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = lr_model.classes_)
disp.plot()
plt.grid(False)
plt.show()
Accuracy:  0.9937926753569212
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       456
           1       0.99      1.00      0.99       673
           2       1.00      1.00      1.00       216
           3       0.99      0.99      0.99       266

    accuracy                           0.99      1611
   macro avg       0.99      0.99      0.99      1611
weighted avg       0.99      0.99      0.99      1611

In [314]:
lr_coef = pd.DataFrame(lr_model.named_steps['logisticregression'].coef_, columns = X.columns).T
# lr_coef = pd.DataFrame(lr_model.named_steps['logisticregression'].coef_, 
#                       columns = lr_model.named_steps['factoranalysis'].get_feature_names_out()).T
lr_bias = pd.DataFrame(lr_model.named_steps['logisticregression'].intercept_, columns = ['Intercept']).T
lr_weights = pd.concat([lr_bias, lr_coef], axis = 0)

fig, ax = plt.subplots(1, 1, figsize=(5, 10))
sns.heatmap(lr_weights, annot = True, cmap='RdBu', ax = ax, fmt ='.2f')
plt.title('Multinomial Logistic Regression Feature Coefficients')
plt.show()

Updated Model (Removed Multicollinearity)¶

Check for Multicollinearity

In [315]:
vif = pd.DataFrame()
vif['feature'] = scaled_df.columns
vif['VIF'] = [variance_inflation_factor(scaled_df.values, i) for i in range(len(scaled_df.columns))]
In [316]:
vif.sort_values("VIF", ascending=False)
Out[316]:
feature VIF
2 PURCHASES 13115.852947
3 ONEOFF_PURCHASES 7664.500999
4 INSTALLMENTS_PURCHASES 3882.218666
17 avg_monthly_purchase 158.916857
19 avg_monthly_installments 153.655325
5 CASH_ADVANCE 47.844883
18 avg_monthly_cash 46.959925
6 PURCHASES_FREQUENCY 13.020626
8 PURCHASES_INSTALLMENTS_FREQUENCY 12.467807
7 ONEOFF_PURCHASES_FREQUENCY 5.106290
9 CASH_ADVANCE_FREQUENCY 4.543645
10 CASH_ADVANCE_TRX 4.114974
11 PURCHASES_TRX 3.156314
0 BALANCE 3.026449
16 TENURE 2.747099
20 utilisation_ratio 2.345546
12 CREDIT_LIMIT 1.910235
1 BALANCE_FREQUENCY 1.497785
13 PAYMENTS 1.470122
15 PRC_FULL_PAYMENT 1.357590
14 MINIMUM_PAYMENTS 1.215975
In [317]:
#List of features with VIF >= 10
high_vif = list(vif[vif['VIF'] >= 15]['feature'])

# Remove features with a high VIF
remove_high_vif = [x for x in features if x not in high_vif]
In [318]:
removals = ['TENURE', 'PRC_FULL_PAYMENT', 'MINIMUM_PAYMENTS', 'BALANCE_FREQUENCY']

for i in removals:
    if i in remove_high_vif:
        remove_high_vif.remove(i)
        
remove_high_vif
Out[318]:
['BALANCE',
 'PURCHASES_FREQUENCY',
 'ONEOFF_PURCHASES_FREQUENCY',
 'PURCHASES_INSTALLMENTS_FREQUENCY',
 'CASH_ADVANCE_FREQUENCY',
 'CASH_ADVANCE_TRX',
 'PURCHASES_TRX',
 'CREDIT_LIMIT',
 'PAYMENTS',
 'utilisation_ratio']
In [319]:
X_vif = clean_df.copy()
X_vif = X_vif[remove_high_vif]
y_vif = clean_df['labels']

X_train_vif, X_test_vif, y_train_vif, y_test_vif = train_test_split(X_vif, y_vif, stratify=y, test_size=0.2, random_state=0)
In [320]:
X_vif.columns
Out[320]:
Index(['BALANCE', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY',
       'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY',
       'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS',
       'utilisation_ratio'],
      dtype='object')
In [321]:
vif_model = make_pipeline(StandardScaler(), LogisticRegression(random_state = 0, multi_class = 'multinomial', solver='lbfgs'))
vif_model.fit(X_train_vif, y_train_vif)
vif_y_pred = vif_model.predict(X_test_vif)
In [322]:
print('Accuracy: ', accuracy_score(y_test_vif, vif_y_pred))
print(classification_report(y_test_vif, vif_y_pred))
cm = confusion_matrix(y_test_vif, vif_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = vif_model.classes_)
disp.plot()
plt.grid(False)
plt.show()
Accuracy:  0.8733705772811918
              precision    recall  f1-score   support

           0       0.83      0.91      0.87       456
           1       0.92      0.90      0.91       673
           2       0.86      0.71      0.78       216
           3       0.86      0.87      0.86       266

    accuracy                           0.87      1611
   macro avg       0.87      0.85      0.85      1611
weighted avg       0.87      0.87      0.87      1611

In [323]:
vif_coef = pd.DataFrame(vif_model.named_steps['logisticregression'].coef_, columns = X_vif.columns).T
# lr_coef = pd.DataFrame(lr_model.named_steps['logisticregression'].coef_, 
#                       columns = lr_model.named_steps['factoranalysis'].get_feature_names_out()).T
vif_bias = pd.DataFrame(vif_model.named_steps['logisticregression'].intercept_, columns = ['Intercept']).T
vif_weights = pd.concat([vif_bias, vif_coef], axis = 0)

fig, ax = plt.subplots(1, 1, figsize=(5, 10))
sns.heatmap(vif_weights, annot = True, cmap='RdBu', ax = ax, fmt ='.2f')
plt.title('Multinomial Logistic Regression Feature Coefficients (Removed Multicollinearity)')
plt.show()

8. Decision Tree Classifier¶

In [324]:
dt_model = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state = 0))
dt_model.fit(X_train, y_train)
dt_y_pred = dt_model.predict(X_test)
In [325]:
print('Accuracy: ', accuracy_score(y_test, dt_y_pred))
print(classification_report(y_test, dt_y_pred))
cm = confusion_matrix(y_test, dt_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = dt_model.classes_)
disp.plot()
plt.grid(False)
plt.show()
Accuracy:  0.9248913718187461
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       456
           1       0.94      0.93      0.93       673
           2       0.94      0.93      0.93       216
           3       0.87      0.90      0.88       266

    accuracy                           0.92      1611
   macro avg       0.92      0.92      0.92      1611
weighted avg       0.93      0.92      0.93      1611

In [326]:
dt_importance = pd.DataFrame({'Feature': features, 'Importance': dt_model.named_steps['decisiontreeclassifier'].feature_importances_})

dt_importance.sort_values('Importance', ascending=False, inplace=True)

plt.figure(figsize=(5, 7))
sns.barplot(y = 'Feature', x = 'Importance',orient = 'h', data=dt_importance)
plt.title('Feature Importance in the Decision Tree Classifier', fontsize=14)
plt.show()
In [327]:
text_representation = tree.export_text(dt_model.named_steps['decisiontreeclassifier'])
#print(text_representation)
In [328]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(dt_model.named_steps['decisiontreeclassifier'], max_depth = 3,
                   feature_names=X.columns,  
                   class_names=y.unique().astype(str), fontsize=13,
                   filled=True)

9. Random Forest¶

In [329]:
rf_model = make_pipeline(StandardScaler(), RandomForestClassifier(random_state = 0, max_depth=4))
rf_model.fit(X_train, y_train)
rf_y_pred = rf_model.predict(X_test)
In [330]:
print('Accuracy: ', accuracy_score(y_test, rf_y_pred))
print(classification_report(y_test, rf_y_pred))
cm = confusion_matrix(y_test, rf_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = rf_model.classes_)
disp.plot()
plt.grid(False)
plt.show()
Accuracy:  0.91495965238982
              precision    recall  f1-score   support

           0       0.91      0.91      0.91       456
           1       0.90      0.96      0.93       673
           2       0.94      0.89      0.91       216
           3       0.94      0.84      0.89       266

    accuracy                           0.91      1611
   macro avg       0.92      0.90      0.91      1611
weighted avg       0.92      0.91      0.91      1611

In [331]:
rf_importance = pd.DataFrame({'Feature': features, 'Importance': rf_model.named_steps['randomforestclassifier'].feature_importances_})

rf_importance.sort_values('Importance', ascending=False, inplace=True)

plt.figure(figsize=(5, 7))
sns.barplot(y = 'Feature', x = 'Importance',orient = 'h', data=dt_importance)
plt.title('Feature Importance of Random Forest Classifier', fontsize=14)
plt.show()
In [332]:
plt.figure(figsize=(30,15))
plot_tree(rf_model.named_steps['randomforestclassifier'].estimators_[1], filled=True, feature_names = X.columns, fontsize=14)
plt.show()

10. Adapted McKinsey Portfolio Matrix¶

In [333]:
# get the values for each col grouped by cluster
port_matrix_table = clean_df.groupby('labels')[["PURCHASES", "CASH_ADVANCE", "INSTALLMENTS_PURCHASES",
                      'ONEOFF_PURCHASES', 'utilisation_ratio', "PURCHASES_FREQUENCY",
                      "CREDIT_LIMIT", 'PAYMENTS','MINIMUM_PAYMENTS', 'PURCHASES_TRX', 'CASH_ADVANCE_TRX']]

n_customers = port_matrix_table.size()
port_matrix_table = port_matrix_table.mean()
port_matrix_table = port_matrix_table.reset_index()
port_matrix_table
Out[333]:
labels PURCHASES CASH_ADVANCE INSTALLMENTS_PURCHASES ONEOFF_PURCHASES utilisation_ratio PURCHASES_FREQUENCY CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PURCHASES_TRX CASH_ADVANCE_TRX
0 0 642.183982 109.978963 481.368534 161.439056 0.767346 0.837187 3311.063111 802.161004 469.790589 15.118086 0.500878
1 1 250.831447 368.145209 38.631076 212.346993 0.586889 0.189751 3202.318970 813.312449 552.604758 3.240416 1.585438
2 2 2477.163895 190.412958 947.826614 1529.382609 0.718032 0.892663 5791.518796 2340.459019 592.806467 36.727105 0.728030
3 3 200.352737 2941.678073 67.417150 133.012030 0.377535 0.174324 5612.572340 2167.597754 1465.857342 3.372180 9.601504
In [348]:
port_matrix_table['labels'] = [0,1,2,3]
port_matrix_table = port_matrix_table.sort_values(by=['labels'])
port_matrix_table['segments'] = [ 'Active Micro Purchases', 'Conservative Customers', 
                                 'Active Customers', 'Cash Advance Customers']
In [349]:
sns.set_theme(style='whitegrid')
g = sns.scatterplot(data=port_matrix_table, x='PURCHASES_FREQUENCY', y='PAYMENTS', hue='segments', s=800)
g.set_xticks(list(np.linspace(0, 1, 4)))
g.set_xticklabels(['0', 'Low', 'Moderate', 'High'])
g.set_xlabel('Loyalty')

g.set_yticks(list(np.linspace(700, 2400, 4)))
g.set_yticklabels(['0', 'Low', 'Moderate', 'High'])
g.set_ylabel('Profitability')
g.legend(bbox_to_anchor=(1.05, 1))
plt.title('Adapted McKinsey Portfolio Matrix')
plt.show()

11. Appendix¶

#. Linear Discriminant Analysis (Used for Testing)¶

In [336]:
X_train1 = np.array(X_train)
y_train1 = np.array(y_train)

lda_model = make_pipeline(StandardScaler(), #transformer, 
                          LinearDiscriminantAnalysis(n_components = 2))
lda_model.fit(X_train1, y_train1)
Out[336]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('lineardiscriminantanalysis',
                 LinearDiscriminantAnalysis(n_components=2))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('lineardiscriminantanalysis',
                 LinearDiscriminantAnalysis(n_components=2))])
StandardScaler()
LinearDiscriminantAnalysis(n_components=2)
In [337]:
X_test1 = np.array(X_test)
y_test1 = np.array(y_test)

lda_y_pred = lda_model.predict(X_test1)
In [338]:
print("Accuracy: ", accuracy_score(y_test1, lda_y_pred))
print(classification_report(y_test1, lda_y_pred))
cm = confusion_matrix(y_test1, lda_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = lda_model.classes_)
disp.plot()
plt.grid(False)
plt.show()
Accuracy:  0.9267535692116697
              precision    recall  f1-score   support

           0       0.91      0.91      0.91       456
           1       0.91      0.98      0.94       673
           2       0.97      0.87      0.92       216
           3       0.97      0.88      0.92       266

    accuracy                           0.93      1611
   macro avg       0.94      0.91      0.92      1611
weighted avg       0.93      0.93      0.93      1611

In [339]:
lda_coef = pd.DataFrame(lda_model.named_steps['lineardiscriminantanalysis'].coef_, columns = X.columns).T
# lda_coef = pd.DataFrame(lda_model.named_steps['lineardiscriminantanalysis'].coef_, 
#                        columns = lda_model.named_steps['factoranalysis'].get_feature_names_out()).T
lda_bias = pd.DataFrame(lda_model.named_steps['lineardiscriminantanalysis'].intercept_, columns = ['Intercept']).T
lda_weights = pd.concat([lda_bias, lda_coef], axis = 0)

fig, ax = plt.subplots(1, 1, figsize=(12, 10))
sns.heatmap(lda_weights.round(3), annot = True, cmap='RdBu', ax = ax)
plt.suptitle('LDA Feature Coefficients')
plt.show()
In [340]:
pd.Series(np.abs(lda_model.named_steps['lineardiscriminantanalysis'].coef_).sum(axis=1), 
          index=lda_model.named_steps['lineardiscriminantanalysis'].classes_).sort_values().plot.bar(
    figsize=(12, 6), title="LDA Class Coefficient Sums"
)
plt.show()
#Cluster 2, 0 is much more separable than 1
In [341]:
X_hat = lda_model.fit_transform(X, y)

import matplotlib as mpl

colors = mpl.cm.get_cmap(name='tab20').colors
categories = pd.Categorical(pd.Series(y)).categories

ret = pd.DataFrame(
    {'C1': X_hat[:, 0], 'C2': X_hat[:, 1], 'Type': pd.Categorical(pd.Series(y))}
)

fig, ax = plt.subplots(1, figsize=(12, 6))

for col, cat in zip(colors, categories):
    (ret.query('Type == @cat').plot.scatter(x='C1', y='C2', color=col, label=cat, ax=ax,
                       s=100, edgecolor='black', linewidth=1,
                       title='Two-Component LDA Decomposition').legend(bbox_to_anchor=(1.2, 1))
    )
#Classifying the different clusters of credit customers are linearly distinguishable by the numerical values alone.

#. OLS Linear Regression Model¶

In [342]:
clean_df.columns
Out[342]:
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE', 'avg_monthly_purchase', 'avg_monthly_cash',
       'avg_monthly_installments', 'utilisation_ratio', 'labels'],
      dtype='object')
In [343]:
#The goal is to identify a feature that is distinct from the rest so we can possibly exploit it for the business to work on.
In [344]:
#Edit here to test out different target variables and the scaling involved.
target = 'CASH_ADVANCE_TRX'
scaling_technique = StandardScaler()
############################

clf_x = clean_df.drop(columns = [target], axis = 1)
clf_y = clean_df[target]

# clf_x = clean_df.copy()
# clf_x = clean_df[remove_high_vif]
# clf_y = clean_df[target]
                    
#Linear Regression Model    
clf_model = make_pipeline(scaling_technique, LinearRegression())
clf_model.fit(clf_x, clf_y)
clf_y_pred = clf_model.predict(clf_x)

#Weights output
clf_coef = pd.DataFrame([clf_model.named_steps['linearregression'].coef_], columns = clf_x.columns).T
clf_bias = pd.DataFrame([clf_model.named_steps['linearregression'].intercept_], columns = ['Intercept']).T
clf_weights = pd.concat([clf_bias, clf_coef], axis = 0)

fig, ax = plt.subplots(1, 1, figsize=(5, 10))
sns.heatmap(clf_weights, annot = True, cmap='RdBu', ax = ax, linewidths = .5, fmt ='.2f')
plt.title(f'Linear Regression of {target}\'s Feature Coefficients')
plt.show()
In [345]:
sm_model = sm.OLS(clf_y, clf_x).fit()
print(sm_model.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:       CASH_ADVANCE_TRX   R-squared (uncentered):                   0.815
Model:                            OLS   Adj. R-squared (uncentered):              0.815
Method:                 Least Squares   F-statistic:                              1687.
Date:                Sat, 15 Apr 2023   Prob (F-statistic):                        0.00
Time:                        22:08:24   Log-Likelihood:                         -17872.
No. Observations:                8054   AIC:                                  3.579e+04
Df Residuals:                    8033   BIC:                                  3.593e+04
Df Model:                          21                                                  
Covariance Type:            nonrobust                                                  
====================================================================================================
                                       coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
BALANCE                             -0.0001   2.72e-05     -5.048      0.000      -0.000    -8.4e-05
BALANCE_FREQUENCY                   -0.3233      0.122     -2.650      0.008      -0.562      -0.084
PURCHASES                            0.0011      0.003      0.333      0.739      -0.005       0.007
ONEOFF_PURCHASES                    -0.0005      0.003     -0.164      0.870      -0.007       0.006
INSTALLMENTS_PURCHASES              -0.0005      0.003     -0.159      0.874      -0.007       0.006
CASH_ADVANCE                         0.0021      0.000     19.670      0.000       0.002       0.002
PURCHASES_FREQUENCY                  0.9663      0.232      4.174      0.000       0.512       1.420
ONEOFF_PURCHASES_FREQUENCY          -0.4388      0.209     -2.099      0.036      -0.849      -0.029
PURCHASES_INSTALLMENTS_FREQUENCY    -0.1423      0.225     -0.633      0.527      -0.583       0.298
CASH_ADVANCE_FREQUENCY              19.5912      0.219     89.325      0.000      19.161      20.021
PURCHASES_TRX                       -0.0015      0.003     -0.539      0.590      -0.007       0.004
CREDIT_LIMIT                     -2.085e-05    1.1e-05     -1.890      0.059   -4.25e-05    7.79e-07
PAYMENTS                         -6.602e-05   2.01e-05     -3.279      0.001      -0.000   -2.66e-05
MINIMUM_PAYMENTS                   6.01e-05   1.57e-05      3.824      0.000    2.93e-05    9.09e-05
PRC_FULL_PAYMENT                     0.1194      0.102      1.173      0.241      -0.080       0.319
TENURE                              -0.0322      0.011     -2.903      0.004      -0.054      -0.010
avg_monthly_purchase                -0.0074      0.004     -1.951      0.051      -0.015    3.41e-05
avg_monthly_cash                    -0.0178      0.001    -15.197      0.000      -0.020      -0.015
avg_monthly_installments            -0.0017      0.007     -0.236      0.814      -0.016       0.013
utilisation_ratio                    0.0547      0.093      0.589      0.556      -0.127       0.237
labels                               0.4677      0.044     10.603      0.000       0.381       0.554
==============================================================================
Omnibus:                     8161.353   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           963484.445
Skew:                           4.746   Prob(JB):                         0.00
Kurtosis:                      55.735   Cond. No.                     7.96e+04
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 7.96e+04. This might indicate that there are
strong multicollinearity or other numerical problems.